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Abstract 



mensional data analysis with highly correlated predictors. We propose a new method 



> 

^ . 

CN Identifying homogeneous subgroups of variables can be challenging in high di- 

o 

■ called Hexagonal Operator for Regression with Shrinkage and Equality Selection, 

m ■ 

HORSES for short, that simultaneously selects positively correlated variables and 

identifies them as predictive clusters. This is achieved via a constrained least-squares 

^ . problem with regularization that consists of a linear combination of an Li penalty 

H ■ 

_Cy _' for the coefficients and another Li penalty for pairwise differences of the coefficients. 

This specification of the penalty function encourages grouping of positively correlated 
predictors combined with a sparsity solution. We construct an efficient algorithm 
to implement the HORSES procedure. We show via simulation that the proposed 
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method outperforms other variable selection methods in terms of prediction error and 
parsimony. The technique is demonstrated on two data sets, a small data set from 
analysis of soil in Appalachia, and a high dimensional data set from a near infrared 
(NIR) spectroscopy study, showing the flexibility of the methodology. 

Keywords and Phrases: Prediction; Regularization; Spatial correlation; Supervised 
clustering; Variable selection 



1 Introduction 

Suppose that we observe {xi,yi), . . . , {xn, Un) where Xi = {xn, . . . , XipY is a p-dimensional 
predictor and tji is the response variable. We consider a standard linear model for each of 
n observations 

P 

yi = ^ Pj^ij + for z = 1, . . . , n, 
.?=1 

with E(ej) = and Var(ei) = cr^. We also assume the predictors are standardized and the 

response variable is centered, 

n n n 

^yi = 0, ^Xjj = and = 1 for j = 1, . . . ,p. 

i=l i=l i=l 

With the dramatic increase in the amount of data collected in many fields comes a 
corresponding increase in the number of predictors p available in data analyses. For simpler 
interpretation of the underlying processes generating the data, it is often desired to have a 
relatively parsimonious model. It is often a challenge to identify important predictors out 
of the many that are available. This becomes more so when the predictors are correlated. 



As a motivating example, consid er a study invo 



data measurements of cookie dough (jOsborne et al 



ying n ear infrared (NIR) spectroscopy 



19841 ). Near infrared reflectance spec- 



tral measurements were made at 700 wavelengths from 1100 to 2498 nanometers (nm) in 
steps of 2nm for each of 72 cookie doughs made with a standard recipe. The study aims 
to predict dough chemical composition using the spectral characteristics of NIR reflectance 
wavelength measurements. Here, the number of wavelengths p is much bigger than the 
sample size n. 

Many methods have been developed to address this issue of high dimensionality. Section 
[3] contains a brief review. Most of these methods involve minimizing an objective function, 
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like the negative log-likelihood, subject to certain constraints, and the methods in Section 
|3] mainly differ in the constraints used. 

In this paper, we propose a variable selection procedure that can cluster predictors using 
the positive correlation structure and is also applicable to data with p > n. The constraints 
we use balance between an Li norm of the coefficients and an Li norm for pairwise differ- 
ences of the coefficients. We call this procedure a Hexagonal Operator for Regression with 
Shrinkage and Equality Selection, HORSES for short, because the constraint region can be 
represented by a hexagon. The hexagonal shape of the constraint region focuses selection 
of groups of predictors that are positively correlated. 

The goal is to obtain a homogeneous subgroup structure within the high dimensional 
predictor space. This grouping is done by focusing on spatial and/or positive correlation 
in the predictors, similar to supervised clustering. The benefits of our procedure are a 
combination of variance reduction and higher predictive power. 

The remainder of the paper is organized as follows. We introduce the HORSES proce- 
dure and its geometric interpretation in Section [2J We provide an overview of some other 
methods in Section [31 relating our procedure with some of these methods. In Section H] we 
describe the computational algorithm that we constructed to apply HORSES to data and 
address the issue of selection of the tuning parameters. A simulation study is presented in 
Section [51 Two data analyses using HORSES are presented in Section [61 We conclude the 
paper with discussion in Section [3 



In this section we describe our method for variable selection for regression with positively 
correlated predictors. Our penalty terms involve a linear combination of an Li penalty for 
the coefficients and another Li penalty for pairwise differences of coefficients. Computation 
is done by solving a constrained least-squares problem. Specifically, estimates for the 
HORSES procedure are obtained by solving 



2 Model 



p 

argmin \\y — l^j^jW'^ subject to 



p 




(1) 



j<fc 
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(a) Elastic Net (b) OSCAR (c) HORSES 

Figure 1: Graphical representation of the constraint region in the (/3i,/32) plane for 



Elastic Net, (b) OSCAR, and (c) HORSES 



with d^^ < a < 1 and is a thresholding parameter. 

As we describe in Section [3l some methods like Elastic Net and OSCAR can group 
correlated predictors, but they can also put negatively correlated predictors into the same 
group. Our method's novelty is its grouping of positively correlated predictors in addition 
to achieving a sparsity solution. Figure Wic) shows the hexagonal shape of the constraint 
region induced by ([1]), showing schematically the tendency of the procedure to equalize 
coefficients only in the direction oi y = x. 

The lower bound d"^ of a prevents the estimates from being a solution only via the 
second penalty function, so the HORSES method always achieves sparsity. We recommend 
d = y/p, where p is the number of predictors. This ensures that the constraint parameter 
region lies between that of the Li norm and of the Elastic Net method, i.e. the set of 
possible estimates for the HORSES procedure is a subset of that of Elastic Net. In other 
words, HORSES accounts for positive correlations up to the level of Elastic Net. With 
d = p, the HORSES parameter region lies within that of the OSCAR method. 

In a graphical representation in the {^1,^2) plane, the solution is the first time the 
contours of the sum of squares loss function hit the constraint regions. Figure 2 gives 



a schematic view. Figure 2(c) shows the solution for HORSES when there is negative 
correlation between predictors. HORSES treats them separately by making /3i = 0. On 
the other hand, HORSES yields Pi = when predictors are positively correlated, as in 



Figure 2(d) 



The following theorem shows that HORSES has the exact grouping property. As the 
correlation between two predictors increases, the predictors are more likely to be g r ouped 



together. Our proof follows closely the proof of Theorem 1 in iBondell and Reich! (120081 ) 
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(a) (b) (c) (d) 

Figure 2: Graphical representation in the (/3i,/32) plane. HORSES solutions are the first 
time the contours of the sum of squares function hit the hexagonal constraint region. 



Contours centered at OLS estimate with a negative correlation. Solution occurs at Pi = 0; 



(d) Contours centered at OLS estimate with a positive correlation. Solution occurs at 



/3i = /32. 

and is hence relegated to an Appendix. 

Theorem 1. Let Ai = Xa and A2 = A(l — a) be the two tuning parameters in the HORSES 
criterion. Given data (y, X) with centered response y and standardized predictors X = 
(xi, . . . ,XpY, let /3(Ai, A2) he the HORSES estimate using the tuning parameters (Ai, A2). 
Let pi^i = x^xi be the sample correlation between covariates xj, and x;. 

For a given pair of predictors xj, and x/, suppose that both /3fc(Ai, A2) and /3;(Ai, A2) are 
distinct from the other f3m- Then there exists Aq > such that if X > Xq then 

^k{^1^^2) = A2), for all a € 1]. 

Furthermore, it must be that 



Xo<\\y\\y/2{l-pki) (l-a) 



The strength with which the predictors are grouped is controlled by A2. If A2 is in- 
creased, any two coefficients are morely likely to be equal. When Xi and xj are positively 
correlated, Theorem 1 implies that predictors i and j will be grouped and their coefficient 
estimates almost identical. 
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3 Related work 



This brief review cannot do justice to the many variable selection methods that have been 
developed. We highlight several of them, especially those that have links to our HORSES 
procedure. 

While variable selection in regression is an increasingly important problem, it is also 
very challenging, particularly when there is a large number of highly correlated predictors. 
Since the important contribution of t he least absolute shrinkage and selection operator 
(LASSO) method by iTibshiranil (119961 ). many other methods based on regularized or pe- 
nalized regression have been proposed for parsimonious model selection, particularly in 



high dimensions , e.g. 



Zou and Hastie 



2005 



lastic Net, Fused LASSO, OSCAR and Group Pursuit methods 



Tibshirani et al. 



2005 



Bondell and Reich 



2008 



Shen and Huangl . 



2010h 



Briefly, these methods involve penalization to fit a model to data, resulting in 
shrinkage of the estimators. Many methods have focused on addressing various possible 
shortcomings of the LASSO method, for instance when there is dependence or collinearity 
between predictors. 

In the LASSO, a bound is imposed on the sum of the absolute values of the coefficients: 

_ P 

/3 



argmm \\y 

/3 



P 
J=l 



subject to^^ \(3j\ < t, 



where y = {yi, . . . ,yn) and xj = {xij, 



nj 



The LASSO method is a shrinkage method like ridge regression (iHoerl and Kennardl . 



1970l ). with automatic variable selection. Due to the nature of the Li penalty term, LASSO 



shrinks each coefficient and selects variables simultaneously. However, a major drawback 
of LASSO is that if there exists collinearity among a subset of the predictors, it usually 
only selects one to represent the entire coUinear group. Furthermore, LASSO cannot select 
more than n variables when p > n. 

One possible approach is to cluster predictors based on the correla tion structure and 
to use averages of the predictors in each cluster as new predictors. iPark et al.l (120071 ) 
used this approach for gene expression data analysis and introduce the concept of a super 
gene. However, it is sometimes desirable to keep all relevant predictors separate while 
achieving better predictive perfor mance, rath e r tha n to use an average of the predictors. 

(120071 ) for grouping does not account for the 



The hierarchical clustering used in 
correlation structure of the predictors. 



Park et al. 
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Other penalized regression methods have also been proposed for grouped predictors 



Bondell and Reich. 



2008 



Tibshirani et al. 



2005 



Zou and Hastie 



2005 



Shen and Huangl . 



20101). All these methods except Group Pursuit work by introducing a new penalty term 



in addition to the Li penalty term of LASSO to account for correlation structure. For 
example, based on the fact that ridge regression tends to shrink the correlated predictors 



toward each other, Elastic Net (IZou and Hastid . l2005l ) uses a linear combination of ridge 
and LASSO penalties for group predictor selection and can be computed by solving the 
following constrained least squares optimization problem. 



argmm 1 1 y 



subject to a | + (1 — a) /3| < t. 



The second term forces highly correlated predictors to be averaged while the first term 
lea ds to a sparse solut i on of these averaged predictors. 

Bondell and Reich! ( 2008 ) proposed OSCAR (Octagonal Shrinkage and Clustering Al- 



gorithm for Regression), which is defined by 

P 



(3 



argmin 1 1 y 



p p 

subject to \f3j\ + c max{|/3j|, \(3ij\} < t. 

j=l j<k 



By using a pairwise Lqq norm as the second penalty term, OSCAR encourages equality of 
coefficients. The constraint region for the OSCAR procedure is represented by an octagon 
(see Figure [^b)|. Unlike the hexagonal shape of the HORSES procedure, the octagonal 
shape of the constraint region allows for grouping of negatively as well as positively cor- 
related predictors. While this is not necessarily an undesirable property, there may be 
instances when a separation of positively and nega tively correlated predict ors is preferred. 

Unlike Elastic Net and OSCAR, Fused LASSO (ITibshirani et al.l . l2005l ) was introduced 
to account for spatial correlation of predictors. A key assumption in Fused LASSO is that 
the predictors have a certain type of ordering. Fused LASSO solves 

P P P 



argmm | \y 

/3 



subject to \ f3j\ < ti and — /3j-i| < ^2- 



The second constraint, called a fusion penalty, encourages sparsity in the differences of 
coefficients. The method can theoretically be extended to multivariate data, although with 
a corresponding increase in computational requirements. 
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Note that the Fused LASSO signal approximator (FLSA) in lFriedman et ahl ( 120071 ) can 
be considered as a special case of HORSES with design matrix X = I. We also want to point 
out that our penalty function is a convex combination of the Li norm of the coefficients and 
the Li norm of the pairwise differenc es of coefficients. Therefore, it is not a straightforward 
extension of Fused LASSO in which each penalty function is constrained separately. 



She 



( I2OIOI ) extended Fused LASSO by considering all possible pairwise differences and called 
it Clustered LASSO. However, the constraint region of Clustered LASSO does not have a 



hexagonal shape. As a res ult, C 



uster ed LASSO does not have the exact grouping property 



of OSCAR. Consequently, IShd ( I2OIOI ) suggested to use a data-argumentation modification 



such as Elastic Net to achieve exact grou ping 



Finally, the Group Pursuit method of IShen and Huang] (120101 ) is a kind of supervised 
clustering. With a regularization parameter t and a threshold parameter A2, they define 



and estimate (3 using 



argmm \\y 



G{z) 



P 

j=l 



A2 if \z\ > A2 
\z\ otherwise. 



subject to G{(3j 
j<k 



HORSES is a hybrid of the Group Pursuit and Fused LASSO methods and addresses 
some limitations of the various methods described above. For example, OSCAR cannot 
handle the high-dimensional data while Elastic Net does not have the exact grouping prop- 
erty. 



4 Computation and Tuning 

A crucial component of any variable selection procedure is an efficient algorithm for its 
implementation. In this Section we describe how we developed such an algorithm for the 
HORSES procedure. The Matlab code for this algorithm is available upon request. We 
also discuss here the choice of optimal tuning parameters for the algorithm. 
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4.1 Computation 

Solving the equations for the HORSES procedure ([1]) is equivalent to solving its Lagrangian 
counterpart 

P P 

fm = ^\\y -T^f^j^jW^ + ^1 E + ^2 E \f^j - f^ki (2) 

j=l j=l j<k 

where Ai = Xa and A2 = A(l — a) with A > 0. 

To solve ([2]) to obtain estimates for the HORSE S proc edure, we modify the path- 



wise coordinate descent algorithm of iFriedman et al.l (120071 ) . The pathwise coordinate 



descent algorithm is an adaptation of the coordinate- wise descent algorithm for solving the 
2-dimensional Fused LASSO problem with a non-separable penalty (objective) function. 



Our extension involves modifying the pathwise coor dinate descent a. 



gorith m to solve the 



regression problem with a fusion penalty. As shown in IFriedman et al.l (120071 ). the proposed 
algorithm is much faster than a general quadratic program solver. Furthermore, it allows 
the HORSES procedure to run in situations where p > n. 

Our modified pathwise coordinate descent algorithm has two steps, the descent and the 
fusion steps. In the descent step, we run an ordinary coordinate-wise descent procedure 
to sequentially update each parameter given the others. The fusion step is considered 
when the descent step fails to improve the objective function. In the fusion step, we add 
an equality constraint on pairs of /3^s to take into account potential fusions and do the 
descent step along with the constraint. In other words, the fusion step moves given pairs 
of parameters together under equality constraints to improve the objective function. The 
details of the algorithm are as follows: 

• Descent step: 

The derivative of ([2]) with respect to f3f^ given 13 j = /3j, j 7^ k, is 



9fil3) To ( ^ o \T 



k-1 p 
+hsgn{^k) + h^sgn0j - ^k) + X2 ^ sgn{(3k-Pj), (3) 
j=l j=k+l 

where the /3j's are current estimates of the /3j's and sgn{x) is a subgradient of 
The derivative ([3]) is piecewise linear in with breaks at {0,f3j,j 7^ k} unless 
Pk^{0,~^j,3^k}. 
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If there exists a solution to [df{f3)/df3i^) = 0, we can find an interval {01,02) 
which contains it, and further show that the solution is 

j<k j>k 



{\y^Xk - A2(Ej</fc ^jk + J2j>k ^kj) \ - h 



T 



where y = y - T^j^k l^j^j^ and Sjk = sgn{l3j - ^ 
If there is no solution to {df{f3)/df3k) = 0, we let 



Pi if /(/?;) =min{/(0),/(/3,), iorjy^k} 
if fiO)<f0j), for every jV^- 



• Fusion step: 

If the descent step fails to improve the objective function we consider the fusion 
of pairs of (3j^s. For every single pair [k, l),l ^ k, we consider the equality constraint 
Pk — Pi — 1 and try a descent move in 7. The derivative of ([2]) with respect to 7 
becomes 

+2A2 T.j<k,l ^9n{(3j - 7) + 2A2 T.j>k,l ^9n{l - 

where y = y — J2j^k I f^j-^j- optimal value of 7 obtained from the descent step 

improves the objective function, we accept the move f3f, = (3i = •y. 

4.2 Choice of Tuning Parameters 

Estimation of the tuning parameters a and t used in the algorithm above is very important 
for its successful implementation, as it is for the other methods of penalized regression. 
Several methods have been proposed in the literature, and any of these can be used to 
tune the parameters of the HORSES procedure, i^-fold cross-validation (CV) randomly 
divides the data into K roughly equally sized and disjoint subsets Df,, k = 1,...,K; 
[j^^l D)^ = {1, 2, ... , n}. The CV error is defined by 

K f P _ ' ^ 



k^lieDj^ \ j=l 
10 



where f3j is the estimate of f3j for a given a and t using the data set without Df^. 

General i zed c r oss- vahdation (GCV) and Bayesian information criterion (BIG) 



( iTibshirani 



1996 



Tibshirani et aL 



These are defined by 



GGV(a,t) 
BIG(a,t) 



2005 



ZouetaL 



20071 ) are other popular methods. 



RSS(a,t) 



n — df 

n X log (RSS(a;, t)) + logn x df 



where /3j (a, t) is the estimate of /3j for a given a and t, df is the degrees of freedom and 



RSS(a,t) = ^\ yi - ^f^j{a,t)x 



Here, the degrees of freedom is a measure of model complexi ty. To appl y these methods. 



Tibshirani et al 



onemust estimate the degrees of freedom (lEfron et al.l . |2004| ) . Following 
( I2OO5I ) for Fused LASSO, we use the number of distinct groups of non-zero regression 
coefficients as an estimate of the degrees of freedom. 



5 Simulations 



We numerically compare the performance of HORSES and several other penalized methods: 
ridge regression, LASSO, Elastic Net, and OSGAR. We do this by generating data based 
on six models that differ on the number of data points n, number of predictors p, the 
correlation structure S and the true values of the coefficients /3. The parameters for these 
six models are given in Table [TJ 



The first five mode ls are very similar to those in IZou and Hastid (|2005[ ) and 
Bondell and Reich! (120081 ). Specifically, the data are generated from the model 



y = X(3 + e, 



where e ~ N(0,a'^Y For models 1-4, we generate predictors Xi = {xn 
a multivariate normal distribution with mean and covariance S where S 
j = l,...,p. 



, Xip) from 



J J 



1 for 
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Model 


n 


p 




a 


/3 


1 


20 


8 




3 


(3,2,1.5,0,0,0,0,0)^ 


2 


20 


8 




3 


(3,0,0,1.5,0,0,0,0,2)^ 


3 


20 


8 




3 


(0.85, 0.85, 0.85, 0.85.0.85, 0.85, 0.85, 0.85)^ 


4 


iOO 


A n 
4U 


O.O 


io 


{ c\ no on no o\T 

((V^^^^, 2, p, ■ - , 0, 2, . ,2j 


5 


50 


40 


0.5 


15 


10 10 10 10 
(3^_^,0^_^^ 


6 


50 


100 




3 


15 25 
see text 



Table 1: Parameters for the models used in the simulation study. 

For model 5, the predictors are generated as follows: 

x, = Zi + 7]f, Zi ~ iV(0, 1), z G Gi = {1, . . . , 5} 

x, = Z2 + r/f , Z2 ~ iV(0, 1), z e G2 = {6, . . . , 10} 

X, = Z3 + r/f , Z3 ~ iV(0, 1), 2 e G3 = {11, . . . , 15} 
~ A^(0,l),z = 16,..., 40. 

where 77^^ ~ A^(0, 0.16), z = 1, . . . , 15. Then Corr(xj, xj) ^ 0.85 for i,j e for A; = 1, 2, 3. 

For model 6, we consider the scenario where p > n. We choose p = 100 because this 
is the maximum number of predictors that can be handled by the quadratic programming 
used in OSCAR. The vector of coefficients /3 for model 6 is given by 



/3 = (3,._^,0^ 

5 10 



,0,2, 



.,2,0,^,0,-1.5 
"lO 




We generate 100 data sets of size 2n for each of the 6 models. In each data set, the 
final model is estimated as follows: (i) For each (a,t), we use the first n observations as a 
training set to estimate the model and use the other n observations as a validation set to 
compute the prediction error PE(a,t); (ii) We set the tuning parameters to be the values 
(a*, t*) that minimize the prediction error PE(a, t); (iii) The final model is estimated using 
the training set with (a,t) = (a*,t*). 



We compare the mean square error (MS 
ized methods. The MSE is calculated as in 



]) and the mode 



Tibshiranil (119961 ) via 



complexity of the five penal- 



MSE 



(/3-/3)^V(/3-/3), 
12 



where V is the population covariance matrix for X. The model complexity is measured 
by the number of groups. Based on the coefficient values and correlation structure, Table 
1 shows the true number of groups for each of the six scenarios. Note that the true 
number of groups is not always the same as the degrees of freedom. For example, we 
note that the true number of groups in model 5 is three based on the correlation structure 
although all nonzero coefficients have the same value. On the other hand, model 4 assumes 
a compound symmetric covariance structure, therefore the number of groups only depends 
on the coefficient values. Hence, the order of the coefficients does not matter and we can 
consider model 4 as having only one group of non-zero coefficients. We take the model 
complexity of model 6 to be four, based on the coefficient values. However, it is possible 
that some of the zero coefficients might be included as signals because of strong correlations 
and relatively small differences in coefficient values in this case. For example, the correlation 
between f3^Q = 1 and /351 = is 0.7. Therefore it is possible that the true model complexity 
in this case may be bigger than four. 

The simulation results are summarized in Table 2. The HORSES procedure reports the 
smallest dfs except for models 1 and 6. In both scenarios, the differences of df between the 
least complex model and HORSES is marginal (4 vs 5 in model 1 and 30 vs 33.5 in model 
6). The HORSES procedure is also very competitive in the MSE comparison. Its MSE is 
the smallest in models 2-4 and 6 and the second or third smallest in models 1 and 5. 

It is interesting to observe that HORSES is the best in model 2, but third in model 1 
although the differences in MSE and df of Elastic Net and HORSES in model 1 are minor. 
The values of the parameters are the same in both scenarios, but variables with similar 
coefficients are highly correlated in model 1, while these variables have little correlation 
with each other in model 2. Hence we can consider the grouping of predictors as mainly 
determined by coefficient values in model 2 while in model 1, the correlation structure may 
have an important role in the grouping. This can be confirmed by comparing the median 
MSEs of each method in the two models 1 and 2. As expected, the median MSE in model 1 is 

Table 2: The number of groups in each model used in the simulation study. 



Model 


1 2 


3 4 5 


6 


Number of groups 


3 3 


1 1 3 


4 
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always smaller than the median MSE in model 2. The difference in the median MSEs can be 
interpreted as the gain achieved by using the correlation structure when grouping. Because 
of the explicit form of the fusion penalty in HORSES, our procedure seems to give more 
weight to differences among the coefficient values while still accounting for correlations. As 
a result, HORSES effectively groups in model 2. Not surprisingly, the HORSES procedure 
is much more successful than the other procedures in finding the correct model in model 3, 
where it may give higher weight to the fusion penalty (a close to 1). It also has the smallest 
MSE among the methods. In this case, the true model is not sparse and the LASSO and 
Elastic Net methods fail. HORSES outperforms the other methods again in model 4. Since 
the model assumes the compound symmetric covariance structure, the grouping is solely 
based on the coefficient values. Because of the fusion penalty, the HORSES procedure is 
very effective in grouping and produces 3.5 as the median df while the second smallest df is 
15 with OSCAR. In model 5, HORSES has the second smallest median MSE (=46.1) with 
Elastic Net's median MSE smallest at 40.7. However, HORSES chooses the least complex 
model and shows better grouping compared to Elastic Net. Model 6 considers a large p 
and small n case. The HORSES procedure reports the smallest MSE while the Elastic Net 
chooses the least complex model. However we notice that all methods report at least 30 
as the df. This might be due to the fact that the true model complexity in this case is not 
clear, as we point out above. In summary, the HORSES procedure outperforms the other 
methods in choosing the least complex model and attaining the best grouping, while also 
providing competitive results in terms of MSE. 



6 Data Analysis 



6.1 Cookie dough data 

In this case study, we consider the c o okie d o ugh dataset from 



was a lso an alyzed by 



fcoOSh . and [Hand (120 111 ). 



Brown et al. 



(2001), 



Brown et al. 



Osborne et al. 



Griffin and Brown! (120 12h 



1984), which 



Caron and Doucet 



(120011 ) consider four components as response vari- 



ables: percentage o f fat, sucrose, flour and water associated with each dough piece. Fol- 
lowing iHans we attempt to predict only the flour content of cookies with the 300 
NIR reflectance measurements at equally spaced wavelengths between 120 and 2400 nm 
as predictors (out of the 700 in the full data set). Also following iHansI (1201 if ) we remove the 
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Figure 3: Graphical representation of the correlation matrix of the 300 wavelengths of the 
cookie dough data 

23rd and 61st observations as outliers. Then we split the dataset randomly into a training 
set with 39 observations and a test set with 31 observations. Figure |3] shows the correlations 
between NIR reflectance measurements based on all observations. There are very strong 
correlations between any pair of predictors in the range of 1200-2200 and 2200-2400. Note 
however that strong correlations do not necessarily imply strong signals in this case since 
the correlations can be due to the measurement errors. 

With the training data set, tuning parameters of HORSES are computed to be a = 0.999 
and A = 0.1622 (equivalently, Ai = 0.1620 and A2 = 0.00016). Since the Li penalty 
dominates the penalty function, we expect that both HORSES and LASSO will yield very 
similar results. We compare HORSES, LASSO and Elastic Net via the prediction mean 
squared error and degrees of freedoms on the test data. The OSCAR method is not included 
in the comparison because we are not able to apply it due to the high dimension of the data. 
Table 3 presents the prediction mean squared error and degrees of freedom of each method. 
The Elastic Net has the smallest MSE, but the differences in MSE across the three methods 
are small. On the other hand, the LASSO and HORSES methods provide parsimonious 
models with small degrees of freedom. The estimated coefficients for the LASSO, Elastic 
Net and HORSES methods are presented in Figure |H Elastic Net produces 11 peaks while 
both LASSO and HORSES have 7 peaks. The estimated spike s from LASSO and HORSES 



are consistent with the results obtained in ICaron and Doucet 



(j2008[ ). The main difference 



between the two methods is at wavelengths 1832 and 1836, where the LASSO estimates 



15 



(a) Elastic Net (b) LASSO 

HOASES 



(c) HORSES 

Figure 4: Coefficient estimates for the 300 predictors of the cookie dough data 

are 0.204 and while the HORSES estimates are 0.0853 at both wavelengths. The Elastic 
Net has peaks at wavelength 1784 and 1804 but the other two methods do not provide a 
peak at those wavelengths. We observe a reverse pattern at wavelength 2176. 



6.2 Appalachian Mountains Soil Data 



Our next example is the Appalachian Mountains Soil Data from lBondell and ReichI (120081 ). 
Figure [5] shows a graphical representation of the correlation matrix of 15 soil characteristics 
computed from measurements made at twenty 500-m^ plots located in the Appalachian 
Mountains of North Carolina. The data were collected as part of a study on the relationship 
between rich-cove forest diversity and soil characteristics. Forest diversity is measured 
as the number of different plant species found within each plot. The values in the soil 
data set are averages of five equally spaced measurements taken within each plot and are 
standardized before the data analysis. These soil characteristics serve as predictors with 
forest diversity as the response. 

As can be seen from Figure [5l there are several highly correlated predictors. Note that 
our correlation graphic shows the signed correlation values and is thus different from the 
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Figure 5: Graphical representation of the correlation matrix of the 15 predictors of the 
Appalachian soil data 



one m 



Bondell and Reich! ( l2008l ) showing the absolute value of correlation. The first seven 
covariates are closely related. Specifically they concern positively charged ions (cations). 
The predictors named "calcium", "magnesium", "potassium", and "sodium" are all mea- 
surements of cations of the corresponding chemical elements, while "% Base Saturation", 
"Sum Cations" and "CEC" (cation exchange capacity) are all summaries of cation abun- 
dance. The correlations between these seven covariates fall in the range (0.360, 0.999). 
There is a very strong positive correlation between percent base saturation and calcium 
(r = 0.98), but the correlation between potassium and sodium (r = 0.36) is not quite as 
high as the others. Of the remaining eight variables, the strongest negative correlation is 
between soil pH and exchangeable acidity (r = —0.93). Since both of these are measures 
of acidity, this appears surprising. However, exchangeable acidity measures only a subset 
of the acidic ions measured in pH, this subset being of more significance only at low pH 
values. 

Note that because "Sum Cations" is the sum of the other four cation measurements the 
design matrix for these predictors is not full rank. 

We analyze the data with the HORSES and OSCAR procedures and report the results 
in Table 4. Although OSCAR and HORSES use the same definition of df, the OSCAR 
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procedure groups predictors based on the absolute values of the coefficients. Therefore 
the number of groups is not the same as the df in OSCAR. The resul t s for LASSO using 



5-fold cross-validation and GCV can be found in iBondell and ReichI (120081 ). The 5-fold 
cross-validation OSCAR and HORSES solutions are similar. They select the exact same 
variables, but with slightly different coefficient estimates. Since the sample size is only 
20 and the number of predictors is 15, the 5-fold cross-validation method may not be the 
best choice for selecting tuning parameters. However, using GCV, OSCAR and HORSES 
provide different answers. Compared to the 5-fold cross-validation solutions, the OSCAR 
solution has one more predictor (% Base saturation) while the HORSES solution has 3 
additional predictors (% Base saturation. Zinc, Exchangeable acidity). More interestingly, 
in the OSCAR solution, % Base saturation is not in the group measuring abundance of 
cations, while pH is. 

On the other hand, the % Base saturation variable is included in the abundance of 
cations group. The HORSES solution also produces an additional group of variables con- 
sisting of Phosophorus and pH. 



7 Conclusion 

We proposed a new group variable selection procedure in regression that produces a sparse 
solution and also groups positively correlated variables together. We developed a modified 
pathwise coordinate optimization for applying the procedure to data. Our algorithm is 
much faster than a quadratic program solver and can handle cases with p > n. 

Such a procedure is useful relative to other available methods in a number of ways. 
First, it selects groups of variables, rather than randomly selecting one variable in the 
group as the LASSO method does. Second, it groups positively correlated rather than 
both positively and negatively correlated variables. This can be useful when studying the 
mechanisms underlying a process, since the variables within each group behave similarly, 
and may indicate that they measure characteristics that affect a system through the same 
pathways. Third, the penalty function used ensures that the positively correlated variables 
do not need to be spatially close. This is particularly relevant in applications where spatial 
contiguity is not the only indicator of functional relation, such as brain imaging or genetics. 

A simulation study comparing the HORSES procedure with ridge regression, LASSO, 
Elastic Net and OSCAR methods over a variety of scenarios showed its superiority in terms 
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of sparsity, effective grouping of predictors and MSE. 

It is de s irable to achieve a theoretical optimality such as the oracle property of 



Fan and Lil ( 1200 ll ) in hi gh dimensional cases. One possibility is to extend the idea of 



the adaptive Elastic Net ( IZou and Zhangl . 120091 ) to the HORSES procedure. Then we may 



consider the following penalty form: 



/3 = argmin^ II y — 11^ subject to 

P 

j=l j<k 

where wj are the adaptive data-driven weights. 

Investigating theoretical properties of the above estimator will be a topic of future 
research. 



8 Appendix 

Proof of Theorem 1: 

Note that one can write the HORSES optimization problem in the equivalent Lagrangian 
form 

argmin i^\\y - J^f^jXjf + A (^aj^ + (!-«) J] |/3, - /3fc| j | • (4) 

Suppose the covariates {xi,X2, ■ ■ ■ ,Xp) are ordered such that their corresponding coef- 
ficient estimates satisfy 

^1 < ^2 < ■ ■ ■ < < < ^L+l ■ ■ ■ < 

and /3q+i = ■ ■ ■ = /3p = 0. Let 6i, . . . ,6q denote the G unique nonzero values of the set of 
so that G < Q. For each (7 = 1, 2, . . . , G, let 

Gg = {j : = Og} 

denote the set of indices of the covariates whose estimates of regression coefficients are 6g. 
Let also Wg = \Qg\ be the number of elements in the set Qg 
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Suppose that /3fc(Ai,A2) 7^ f^^ii^i^ ^2) ^^(^ both are non-zero. In addtion, let assume 
k & Qg and I ^ Qy^ioi h > g without loss of generality. The differentiation of the objective 
function (jlj) with respect to /J^ gives 

P 

- 2x1 [y^^ ^j^j) + = 0, 

where u+^g = E.gK.g'^.gl and Mg,+ = E.g<.g2^ff2, and 

Kk = a sgn{Pk) + (l - a) (u+,g - ng,+) . (5) 
In the same way, the differentiation of (jl]) with respect to f3i is 

- "^xj {y -^^jXj) + X^^a sgn0i) + (l - a) - m/^ j =0, 
and we have, by taking their differences, 

P ^ 

-^{^k -^I)iy -^t^j^j) + K'^k- 1^1) =0. 

Since X is standardized, — = 2(1 — pf^i). This together with the fact that 

lb - X^||2 < ||y||2 gives 



However, we find that 



Ki-Kf, = a{sgn{Pi) - sgn0k)} 

+ (1 - a) { {u+^h - Uh,+) - {u+,g - %,+) I , (6) 



is always larger than or equal to 2(1 — a). Thus, If 2A ^||?/|| ^2(1 — p^/) < 2(1 — a) 
equivalently, 111/11^/2(1 — p^) j (1 — a) < A - then we encounter a contradiction. 
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Table 3: MSE and model complexity. 



Case 


Method 


MSE 


MSE 


MSE 


DP 


DP 


DP 






Med. 


10th perc. 


90th perc. 


Med. 


10th perc. 


90th perc. 


CI 


Ridge 


2.31 


0.98 


4.25 


8 


8 


8 




LASSO 


1.92 


0.68 


4.02 


5 


3 


8 




Elastic Net 


1.64 


0.49 


3.26 


5 


3 


7.5 




OSCAR 


1.68 


0.52 


3.34 


4 


2 


7 




HORSES 


1.85 


0.74 


4.40 


5 


3 


8 


C2 


Ridge 


2.94 


1.36 


4.63 


8 


8 


8 




LASSO 


2.72 


0.98 


5.50 


5 


3.5 


8 




Elastic Net 


2.59 


0.95 


5.45 


6 


4 


8 




OSCAR 


2.51 


0.96 


5.06 


5 


3 


8 




HORSES 


2.21 


1.03 


4.70 


5 


2 


8 


C3 


Ridge 


1.48 


0.56 


3.39 


8 


8 


8 




LASSO 


2.94 


1.39 


5.34 


6 


4 


8 




Elastic Net 


2.24 


1.02 


4.05 


7 


5 


8 




OSCAR 


1.44 


0.51 


3.61 


5 


2 


7 




HORSES 


0.50 


0.02 


2.32 


2 


1 


5.5 


C4 


Ridge 


27.4 


21.2 


36.3 


40 


40 


40 




LASSO 


45.4 


32 


56.4 


21 


16 


25 




Elastic Net 


34.4 


24 


45.3 


25 


21 


28 




OSCAR 


25.9 


19.1 


38.1 


15 


5 


19 




HORSES 


21.2 


19.3 


33.0 


3.5 


1 


19.5 


C5 


Ridge 


70.2 


41.8 


103.6 


40 


40 


40 




LASSO 


64.7 


27.6 


116.5 


12 


9 


18 




Elastic Net 


40.7 


17.3 


94.2 


17 


13 


25 




OSCAR 


51.8 


14.8 


96.3 


12 


9 


18 




HORSES 


46.1 


18.1 


92.8 


11 


5.5 


19.5 


C6 


Ridge 


27.71 


19.53 


38.53 


100 


100 


100 




LASSO 


13.36 


7.89 


20.18 


31 


24 


39.1 




Elastic Net 


13.57 


8.49 


25.33 


30 


23.9 


37 




OSCAR 


13.16 


8.56 


19.16 


50.00 


35.9 


83.7 




HORSES 


12.20 


7.11 


2322.02 


33.5 


24 


66.3 



Table 4: Biscuit dough data results 





Elastic Net 


HORSES 


LASSO 


Mean Squared Error 


2.442 


2.586 


2.556 


Degrees of Freedom 


11 


7 


7 



Table 5: Results of analyzing the Appalachian soil data using OSCAR and HORSES, and 
two different methods for choosing the tuning parameters. 



Variable 


OSCAR 


OSCAR 


HORSES 


HORSES 




(5-fold CV) 


(GCV) 


(5-fold CV) 


(GCV) 


% Base saturation 





-0.073 





-0.1839 


Sum cations 


-0.178 


-0.174 


-0.1795 


-0.1839 


GEO 


-0.178 


-0.174 


-0.1795 


-0.1839 


Calcium 


-0.178 


-0.174 


-0.1795 


-0.1839 


Magnesium 














Potassium 


-0.178 


-0.174 


-0.1795 


-0.1839 


Sodium 














Phosphorus 


0.091 


0.119 


0.0803 


0.2319 


Copper 


0.237 


0.274 


0.2532 


0.3936 


Zinc 











-0.0943 


Manganese 


0.267 


0.274 


0.2709 


0.3189 


Humic matter 


-0.541 


-0.558 


-0.5539 


-0.6334 


Density 














pH 


0.145 


0.174 


0.1276 


0.2319 


Exchangeable acidity 











0.0185 


Degrees of Freedom 


6 


5 


6 


7 
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